perm filename MULTID[4,KMC]2 blob sn#030215 filedate 1973-03-16 generic text, type T, neo UTF8
00100		MULTIDIMENSIONAL ANALYSIS IN EVALUATING  A SIMULATION
00200		       OF PARANOID THOUGHT PROCESSES
00300	
00400	               K.M. COLBY AND F.D. HILF
00500	
00600	
00700	
00800		Once  a  simulation  model  reaches  a  stage  of   intuitive
00900	adequacy,  a  model  builder  should  consider  using  more stringent
01000	evaluation procedures relevant to the model's purposes. For  example,
01100	if  the  model  is  to serve as a as a training device, then a simple
01200	evaluation of its pedagogic effectiveness would be sufficient.    But
01300	when  the  model  is  proposed  as  an  explantion of a psychological
01400	process, more is demanded of the evaluation procedure.
01500		We shall not describe our model of paranoid  processes  here.
01600	A description can be found in the literature (Colby, Weber, and Hilf,
01700	1971). We shall concentrate on the evaluation problem which asks "how
01800	good  is  the model?" or "how close is the correspondence between the
01900	behavior of the model and the phenomenena it is intended to explain?"
02000	Turing's  Test  has  often been suggested as an aid in answering this
02100	question.
02200		It  is  very easy to become confused about Turing's Test.  In
02300	part this is due to Turing  himself  who  introduced  the  now-famous
02400	imitation   game   in   a  paper  entitled  COMPUTING  MACHINERY  AND
02500	INTELLIGENCE (Turing,1950).  A careful reading of this paper  reveals
02600	there  are  actually  two  imitation  games  , the second of which is
02700	commonly called Turing's test.
02800		In the first imitation game  two  groups  of  judges  try  to
02900	determine which of two interviewees is a woman. Communication between
03000	judge and  interviewee  is  by  teletype.  Each  judge  is  initially
03100	informed  that  one  of the interviewees is a woman and one a man who
03200	will pretend to be a woman. After the interview, the judge  is  asked
03300	what  we shall call the woman-question i.e. which interviewee was the
03400	woman?  Turing does not say what else  the  judge  is  told  but  one
03500	assumes  the  judge is NOT told that a computer is involved nor is he
03600	asked to determine which  interviewee  is  human  and  which  is  the
03700	computer.  Thus,  the  first  group  of  judges  would  interview two
03800	interviewees:    a woman, and a man pretending to be a woman.
03900		The  second  group  of judges would be given the same initial
04000	instructions, but unbeknownst to them, the two interviewees would  be
04100	a  woman  and a computer programmed to imitate a woman.   Both groups
04200	of judges  play  this  game  until  sufficient  statistical  data are
04300	collected  to  show  how  often the right identification is made. The
04400	crucial question then is:  do the judges decide wrongly AS OFTEN when
04500	the  game  is  played  with man and woman as when it is played with a
04600	computer substituted  for  the  man.  If  so,  then  the  program  is
04700	considered  to  have  succeeded in imitating a woman as well as a man
04800	imitating  a  woman.    For  emphasis  we  repeat;  in   asking   the
04900	woman-question  in  this  game,  judges  are not required to identify
05000	which interviewee is human and which is machine.
05100		Later  on  in  his  paper  Turing proposes a variation of the
05200	first game. In the second game one interviewee is a man and one is  a
05300	computer.   The judge is asked to determine which is man and which is
05400	machine, which we shall call the machine-question. It is this version
05500	of  the game which is commonly thought of as Turing's test.    It has
05600	often been suggested as a means of validating computer simulations of
05700	psychological processes.
05800		In  the  course  of  testing a simulation (PARRY) of paranoid
05900	linguistic behavior in a psychiatric interview, we conducted a number
06000	of  Turing-like  indistinguishability  tests  (Colby,  Hilf,Weber and
06100	Kraemer,1972). We say `Turing-like' because none of them consisted of
06200	playing  the  two  games  described above. We chose not to play these
06300	games for a number of reasons which can be summarized by saying  that
06400	they  do  not  meet modern criteria for good experimental design.  In
06500	designing our tests we were primarily  interested  in  learning  more
06600	about   developing   the  model.   We  did  not  believe  the  simple
06700	machine-question to be  a  useful  one  in  serving  the  purpose  of
06800	progressively   increasing  the  credibility  of  the  model  but  we
06900	investigated a variation of it to satisfy the curiosity of colleagues
07000	in artificial intelligence.
07100		In this design eight psychiatrists  interviewed  by  teletype
07200	two  patients  using  the  technique of machine-mediated interviewing
07300	which involves  what  we  term  "non-nonverbal"  communication  since
07400	non-verbal   cues   are   made  impossible  (Hilf,1972).  Each  judge
07500	interviewed two patients one being PARRY and one being a hospitalized
07600	paranoid  patient.    The  interviewers  were  not  informed  that  a
07700	simulation was involved nor were they asked to identify which was the
07800	machine. Their task was to conduct a diagnostic psychiatric interview
07900	and rate each response from the  `patients'  along  a  0-9  scale  of
08000	paranoidness,  0  meaning  zero  and  9 being highest. Transcripts of
08100	these interviews, without the ratings of the interviewers, were  then
08200	utilized  for  various  experiments in which randomly selected expert
08300	judges conducted evaluations  of  the  interview  transcripts.    For
08400	example,  in one experiment it was found that patients and model were
08500	indistinguishable along the dimension of paranoidness.
08600		To ask the machine-question, we sent  interview  transcripts,
08700	one  with a patient and one with PARRY, to 100 psychiatrists randomly
08800	selected from the Directory of American Specialists and the Directory
08900	of  the  American Psychiatric Association. Of the 41 replies 21 (51%)
09000	made the correct identification while 20 (49%) were wrong.  Based  on
09100	this  random  sample of 41 psychiatrists, the 95% confidence interval
09200	is between 35.9 and 66.5, a range which  is  close  to  chance.  (Our
09300	statistical   consultant   was  Dr.   Helena  C.   Kraemer,  Research
09400	Associate  in  Biostatistics,  Department  of  Psychiatry,   Stanford
09500	University.)
09600		Psychiatrists   are   considered  expert  judges  of  patient
09700	interview behavior but they are unfamiliar with computers.  Hence  we
09800	conducted  the  same  test  with  100  computer  scientists  randomly
09900	selected from the membership list of the  Association  for  Computing
10000	Machinery,  ACM.   Of the 67 replies 32 (48%) were right and 35 (52%)
10100	were wrong. Based on this random sample of 67 computer scientists the
10200	95% confidence ranges from 36 to 60, again close to a chance level.
10300		Thus the answer to this machine-question "can expert  judges,
10400	psychiatrists  aand  computer scientists, using teletyped transcripts
10500	of psychiatric interviews, distinguish between paranoid patients  and
10600	a  simulation  of paranoid processes? " is "No". But what do we learn
10700	from this?   It is some comfort that the answer was not "yes"and  the
10800	null  hypothesis  (no  differences) failed to be rejected, especially
10900	since statistical tests are somewhat biased in favor of rejecting the
11000	null  hypothesis  (Meehl,1967). Yet this answer does not tell us what
11100	we  would  most  like  to  know,  i.e.  how  to  improve  the  model.
11200	Simulation  models  do  not  spring  forth in a complete, perfect and
11300	final form; they must be gradually developed  over  time.  Pehaps  we
11400	might  obtain  a "yes" answer to the machine-question if we allowed a
11500	large number of expert judges to conduct  the  interviews  themselves
11600	rather  than studying transcripts of other interviewers.     It would
11700	indicate that the model must be improved but unless we systematically
11800	investigated how the judges succeeded in making the discrimination we
11900	would not know what aspects of the model to work on. The logistics of
12000	such a design are immense and obtaining a large N of judges for sound
12100	statistical inference would require an effort disproportionate to the
12200	information-yield.
12300		A more efficient and informative way to use Turing-like tests
12400	is to ask judges to make ordinal ratings along scaled dimensions from
12500	teletyped  interviews.     We  shall  term  this  approach asking the
12600	dimension-question.   One can then compare scaled ratings received by
12700	the patients and by the model to precisely determine where and by how
12800	much they differ.        Model builders  strive  for  a  model  which
12900	shows     indistinguishability     along    some    dimensions    and
13000	distinguishability along others.  That is, the model converges on what
13100	it is supposed to simulate and diverges from that which it is not.
13200		We  mailed  paired-interview  transcripts  to   another   400
13300	randomly  selected psychiatrists asking them to rate the responses of
13400	the two `patients' along certain dimensions. The judges were  divided
13500	into  groups,  each  judge  being asked to rate responses of each I-O
13600	pair in the interviews along four dimensions.  The  total  number  of
13700	dimensions  in  this  test  were twelve- linguistic noncomprehension,
13800	thought disorder, organic brain syndrome, bizarreness,  anger,  fear,
13900	ideas  of  reference, delusions, mistrust, depression, suspiciousness
14000	and mania. These are dimensions which psychiatrists commonly  use  in
14100	evaluating patients.
14200		Table 1 shows there were significant differences, with  PARRY
14300	receiving   higher   scores   along   the  dimensions  of  linguistic
14400	noncomprehension,thought disorder, bizarreness, anger,  mistrust  and
14500	suspiciousness. On the dimension of delusions the patients were rated
14600	significantly higher. There were no significant differences along the
14700	dimensions  of  organic  brain  syndrome,fear,  ideas  of  reference,
14800	depression and mania.
14900		While    tests    asking    the   machine-question   indicate
15000	indistinguishability at  the  gross  level,  a  study  of  the  finer
15100	structure  os  the  model's  behavior  through  ratings  along scaled
15200	dimensions  shows  statistically  significant   differences   between
15300	patients  and  model.     These  differences are of help to the model
15400	builder in suggesting which aspects of the model must be modified and
15500	improved  in  order  to  be  considered an adequate simulation of the
15600	class of paranoid patients it is intended to simulate.  For  example,
15700	it  is  clear  that  PARRY'S language-comprehension must be improved.
15800	Once this has been implemented, a future test will  tell  us  whether
15900	improvement has occurred and by how much in comparison to the earlier
16000	version.   Successive identification of particular areas  of  failure
16100	in the model permits their improvement and the development of
16200	more adequate model-versions.
16300		Further evidence that the machine-question is too coarse  and
16400	insensitive  a test comes from the following experiment. In this test
16500	we constructed a random version of the paranoid model which  utilized
16600	PARRY'S  output statements but expressed them randomly no matter what
16700	the interviewer said.   Two psychiatrists conducted  interviews  with
16800	this  model, transcripts of which were paired with patient interviews
16900	and sent to 200  randomly  selected  psychiatrists  asking  both  the
17000	machine-question  and  the dimension-question.  Of the 69 replies, 34
17100	(49%) were right and 35 (51%) wrong. Based on this random  sample  of
17200	69  psychiatrists,  the 95% confidence interval ranges from 39 to 63,
17300	again indicating  a  chance  level.  However  as  shown  in  Table  2
17400	significant  differences  appear  along  the dimensions of linguistic
17500	noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
17600	rated  higher.  On  these  particular  dimensions  we can construct a
17700	continuum in which the random version  represents  one  extreme,  the
17800	actual patients another. Our (nonrandom) PARRY lies somewhere between
17900	these two extremes, indicating that it performs significantly  better
18000	than  the  random version but still requires improvement before being
18100	indistinguishable from  patients.(See  Fig.1).  Table  3  presents  t
18200	values   for   differences   between   mean   ratings  of  PARRY  and
18300	RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
18310		Thus it can be seen that  such  a  multidimensional  analysis
18410	provides  yardsticks  for measuring the adequacy of this or any other
18510	dialogue simulation model along the relevant dimensions.
18800		We conclude that when model builders want  to  conduct  tests
18900	which  indicate  in  which  direction  progress  lies and to obtain a
19000	measure of whether  progress  is  being  achieved,  the  way  to  use
19100	Turing-like  tests  is  to  ask  expert  judges to make ratings along
19200	multiple dimensions that are essential to the model.  Useful tests do
19300	not  prove  a  model, they probe it for its strengths and weaknesses.
19400	Simply asking the machine-question yields little information relevant
19500	to what the model builder most wants  to  know,  namely,  along  what
19600	dimensions must the model be improved.
19700	
19800	
19900			REFERENCES
20000	
20100	[1]  Colby, K.M., Weber, S. and Hilf,F.D.,1971. Artificial paranoia. 
20200	       ARTIFICIAL INTELLIGENCE,2, 1-25.
20300	
20400	
20500	[2]  Colby,K.M.,Hilf,F.D.,Weber, S.and Kraemer,H.C.,1972. Turing-like
20600		indistinguishability tests for the validation  of a  computer
20700		simulation  of paranoid  processes. ARTIFICIAL  INTELLIGENCE,3,
20800		199-221.
20900	
21000	[3]  Hilf, F.D.,1972. Non-nonverbal communication and psychiatric research.
21100	               ARCHIVES OF GENERAL PSYCHIATRY, 27, 631-635.
21200	[4]  Meehl, P.E.,1967. Theory testing in  psychology  and  physics: a
21300		methodological paradox. PHILOSOPHY OF SCIENCE,34,103-115.
21400	
21500	[5]  Turing,A.,1950. Computing machinery and intelligence. Reprinted in:
21600		COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
21700		McGraw-Hill, New York,1963,pp. 11-35.